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5.  INTRODUCTION 

This  research  is  to  develop  a  large  database  of  digitized  mammograms  that  will  be 
distributed  free  of  charge  to  interested  researchers.  It  is  being  funded  by  the  USAMRMC  as  an 
infrastmcture  award  and  as  such  there  it  does  not  represent  a  research  project  per  se.  That  is,  there 
is  no  hypothesis  that  we  are  trying  to  prove.  Therefore,  this  report  is  structured  slightly  different 
than  a  normal  scientific  research  report  --  heavy  on  the  method  and  light  on  actual  results.  In  this 
project,  the  procedure  is  the  most  important  component,  which  is  applied  continuously  in  a 
straightforward  manner  to  achieve  the  goal  of  creating  the  database  of  mammograms. 

5.1  Nature  of  the  Problem 

In  1992,  the  National  Cancer  Institute  identified  digital  mammography  as  an  important  area 
of  research  for  reducing  breast  cancer  mortality.[l]  As  a  result,  there  has  been  a  sharp  increase  in 
the  number  of  researchers  developing  computerized  methods  for  analyzing  mammograms.  This  is 
due  in  part  to  the  substantial  potential  benefit  from  developing  an  automated  computerized  system 
for  assisting  radiologists  in  interpreting  mammograms.  With  a  large  number  of  investigators 
developing  computerized  analysis  techniques,  the  likelihood  of  an  accurate  method  being 
developed  is  high.  Unfortunately,  a  major  obstacle  to  rapid  progress  in  developing  a  technique  is 
that  each  investigator  uses  his  or  her  own  set  of  mammograms  (database)  to  develop  and  evaluate 
the  performance  of  his  or  her  technique.  As  a  result,  it  is  not  possible  to  compare  the  accuracy  of 
different  methods  because  the  measured  performance  is  dependent  on  the  cases  used  for  testing.  [2] 
For  example,  by  using  "easy"  cases  for  testing,  a  computer  technique  would  apparently  have  a 
higher  accuracy  than  if  "hard"  cases  were  used.  A  common  database  of  mammograms  that  could 
be  used  by  all  investigators  in  the  field  would  solve  this  problem. 

5.2.  Background:  Previous  work  in  the  field 

At  a  Biomedical  Image  Processing  meeting  held  February  1993,  in  San  Jose  CA,  12 
panelists  discussed  the  design  of  a  common  database  for  research  in  mammographic  image 
analysis.  [3]  Two  of  the  panelists  are  investigators  are  on  this  proposal.  Important  considerations 
in  developing  the  database  are:  (a)  the  cases  selected,  0?)  the  digitizer  used,  (c)  organization  of  the 
database,  (d)  associated  information  to  be  included  with  images,  (e)  "truth"  for  each  case,  (f) 
format  of  image  files,  (g)  distribution  of  the  database,  and  (hi)  rules  on  using  the  database. 

There  have  been  several  small  databases  released  for  general  use.  However,  all  have 
several  limitations  to  due  to  insufficient  spatial  resolution,  insufficient  grey-scale  resolution,  and/or 
too  small  a  number  of  cases.  The  database  that  we  are  developing  will  have  none  of  these 
limitations. 

5.3.  Purpose 

The  purpose  of  this  proposal  is  to  develop  a  database  of  digital  mammograms  that  can  be 
used  by  researchers  who  (1)  are  trying  to  determine  the  image  quality  requirements  of  detectors  for 
digital  mammography;  (2)  are  developing  image  processing  techniques  to  optimize  the  displayed 
digital  mammogram;  (3)  are  developing  computerized  methods  for  analyzing  mammograms;  (4)  are 
studying  the  effects  of  image  compression  methods  on  image  quality;  (5)  are  developing  methods 
for  remote  transmission  of  mammograms;  and  (6)  are  studying  the  relationship  between  image 
quality  and  diagnostic  accuracy.  This  database  also  could  be  used  as  a  resource  for  teaching 
radiology  residents  and  for  testing  the  performance  levels  of  mammographers. 

The  specific  aims  of  this  proposal  are: 
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1.  Collect  and  digitize  200  cases  in  each  of  5  different  categories,  mammograms  exhibiting:  (i) 
clustered  microcalcifications,  (ii)  masses,  (iii)  architectural  distortions,  (iv)  asymmetric  densities, 
and  (v)  no  lesions  (i.e.  normals). 

2.  Make  these  cases  available  to  other  researchers  either  over  eomputer  network  (Internet)  or  by 
sending  images  on  computer  tape  or  CD.  The  database  will  be  distributed  as  widely  as  possible  so 
that  comparisons  of  different  computerized  analysis  techniques  can  be  standardized. 


5.4.  Method  of  Approch 

Task  1:  Collect  and  digitize  mammograms.  Months  1-48.  (See  Figure  1.) 

a.  Retrieve  from  film  library  cases  with  pathologically-proven  lesions  (clustered 
microcalcifications,  breast  masses,  architectural  distortion,  asymmetric  densities),  100 
cases  of  each  type  and  100  normals  (cases  without  lesions)  from  each  site  [University  of 
Chicago  (UC)  and  University  of  North  Carolina  (UNC)]  for  a  total  of  1000  cases  during 
the  entire  funding  period. 

b.  At  each  site,  digitize  retrieved  films  and  outline  the  location  of  the  lesion  in  each  abnormal 
image.  The  outline  will  be  stored  together  with  the  images  but  in  a  separate  file. 

c.  Send  normal  cases  and  asymmetric  density  cases  that  were  digitized  at  UC  to  UNC;  and 
send  eases  containing  masses,  microcalcifications,  and  architectural  distortion  that  were 
digitized  at  UNC  to  UC. 

d.  Selectively  randomize  200  cases  for  each  lesion  type  into  one  of  two  sets  (training  and 
testing),  based  on  lesion  subtlety.  Similarly,  selectively  randomize  200  normal  cases  into 
two  sets  based  on  breast  density. 

e.  Place  testing  set  in  off-line  storage  and  training  cases  in  on-line  storage. 

f  On  average  250  cases  (2500  image  -  see  text  for  details)  will  be  done  per  year  for  4  years 
for  a  total  of  1000  cases  (10,000)  images. 

Task  2:  Establish  protocol  for  transmitting  database.  Months  1-24 

a.  Test  protocols  for  different  modes  of  transferring  data  between  the  UNC  and  UC  (FTP,  8- 
mm  tape,  and  CD).  A  data  structure  designed  for  portability  will  be  provided  to  contain 
the  patient  text  data;  this  data  structure  will  be  made  available  along  with  the  data  to  the 
requesting  sites.  Use  of  ACR/NEMA  DICOM  protocol  will  be  investigated  and 
ineorporated  as  an  optional  transfer  mechanism. 

Task  3:  Maintain  database  and  distribute  cases  Months  12-48. 

a.  Maintain  computer,  jukebox,  and  network  coimection  ineluding  bug  fixes  and  installation  of 
vendor  software  updates. 

b.  Distribute  cases  via  computer  network  and  by  mass  storage  media  (tape  or  CD)  as 
requested. 
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Figure  1.  A  flowchart  of  the  steps  required  to  collect,  digitize,  archive,  and  distribute 
the  mammographic  database.  The  'Full  Image'  is  the  whole  digitized  mammogram  at  full 
resolution.  The  'Reduced  Image'  is  a  minified  version  (reduced  resolution)  of  the  full 
image.  The  'ROI  Image'  is  a  portion  of  the  full  image  at  full  resolution. 
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6.  PROGRESS  TO  DATE 
Task  1. 


We  have  retrieved,  digitized,  classified  and  filed  178  cases,  1 1 1  for  the  University  of 
Chicago  (UofC)  and  67  from  the  University  of  North  Carolina  (UNC)  as  of  October  1,  1995. 
These  include  lesions  from  all  categories  with  the  majority  being  masses  and  microcalcifications. 
See  Table  I.  The  image  subltety  has  been  ranked  on  a  5  point  scale  (1-5)  with  1  being  the  most 
difficult  to  detect.  All  cases  are  archived  on  8-mm  tape. 

The  computer  systems  that  will  hold  the  database  have  been  purchased  and  installed. 
Current  capacity  of  each  system  is  approximately  40  gigabyte  (one  system  at  each  site).  The  total 
capacity  of  each  system  will  be  increased  in  the  third  year  of  the  project. 


Task  2. 


At  his  point,  we  have  not  transferred  data  between  the  two  sites.  This  will  be  done  in  the 
second  year,  when  a  larger  number  of  cases  has  been  collected.  We  originally  considered  the 
ACR/NEMA  (DICOM)  image  format  for  our  database.  However,  the  ACR/NEMA  format  does 
not  have  a  module  for  mammography,  and  it  would  be  an  extensive  project  to  develop  one  at  this 
time.  Currently,  then,  we  are  storing  the  images  as  a  binary  array  of  numbers  with  a  simple  512- 
byte  header.  When  an  ACR/NEMA  mammography  module  becomes  available,  it  will  be  easy  to 
convert  our  files  to  that  format. 


Task  3. 


Maintenance  of  the  database  and  distribution  of  the  database  are  at  a  minumum  currently. 
These  tasks  will  become  important  in  the  next  and  subsequent  years  as  cases  go  “on-line”. 


7.  CONCLUSIONS 

The  develpment  of  a  common  database  of  mammograms  for  digital  mammography  research 
is  well  underway.  The  first  release  of  a  portion  of  the  database  is  anticipated  for  the  end  of  1995. 
This  release  will  include  100  cases  of  clustered  microcalcifications. 

A  database  of  mammograms  would  also  be  useful  for  investigators  doing  research  in  other 
areas  of  digital  mammography,  such  as  x-ray  detector  development,  telemammography,  image 
compression,  and  image  processing.  For  example,  questions  such  as  the  required  spatial 
resolution  of  a  digital  mammogram  can  be  answered  in  part  by  conducting  observer  studies  using 
the  mammograms  from  the  database  displayed  at  different  resolutions.  Furthermore,  the  database 
would  provide  an  excellent  source  of  cases  that  could  be  used  for  teaching  purposes. 
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Table  I.  Breakdown  of  cases  in  the  database  as 

of  October  1/95. 

Type  of  Lesion 

Pathology 

Subtlety 

#  of  Cases 

Mass 

Mahgnant 

1 

17 

Mass 

Benign 

1 

2 

Mass 

Mahgnant 

2 

12 

Mass 

Benign 

2 

8 

Mass 

Mahgnant 

3 

11 

Mass 

Benign 

3 

12 

Microcalcifications 

Mahgnant 

1 

14 

Microcalcifications 

Benign 

1 

13 

Microcalcifications 

Mahgnant 

2 

18 

Microcalcifications 

Benign 

2 

12 

Microcalcifications 

Mahgnant 

3 

16 

Microcalcifications 

Benign 

3 

13 

Asymmetric  Density 

Mahgnant 

1 

7 

Asymmetric  Density 

Benign 

1 

0 

Asymmetric  Density 

Mahgnant 

2 

6 

Asymmetric  Density 

Benign 

2 

1 

Asymmetric  Density 

Mahgnant 

3 

3 

Asymmetric  Density 

Benign 

3 

1 

Architectural  Distortion 

Mahgnant 

1 

5 

Architectural  Distortion 

Benign 

1 

1 

Architectural  Distortion 

Mahgnant 

2 

3 

Architectural  Distortion 

Benign 

2 

1 

Architectural  Distortion 

Mahgnant 

3 

1 

Architectural  Distortion 

Benign 

3 

0 

Normal 

- 

- 

1 

TOTAL 


143 


